A Probabilistic Algorithm for Segmenting Non-Kanji Japanese Strings
نویسندگان
چکیده
We present an algorithm for segmenting unrestricted Japanese text that is able to detect up to 98% of the words in a corpus. The segmentation technique, which is simple and extremely fast, does not depend on a lexicon or any formal notion of what a word is in Japanese, and the training procedure does not require annotated text of any kind. Relying almost exclusively on character type information and a table of hiragana bigram frequencies, the algorithm makes a decision as to whether to create word boundaries or not. This method divides strings of Japanese characters into units that are computationally tractable and that can be justified on lexical and syntactic grounds as well. part-of-speech taggers have been used to obtain information about the lexical, syntactic, and some semantic properties of large corpora. Automatic text tagging is an important first step in discovering the linguistic structure of large text corpora.
منابع مشابه
Segmenting Sentences into Linky Strings Using D-bigram Statistics
It is obvious that segmentation takes an important role in natural language processing(NLP), especially for the languages whose sentences are not easily separated into morphemes. In this s tudy we propose a method of segmenting a sentence. The system described in this paper does not use any grammatical information or knowledge in processing. Instead, it uses statistical information drawn from n...
متن کاملAcquired Dyslexia in Japanese : Implications for Reading Theory
Acquired dyslexia research has been conducted mainly on English neurological patients. A limited number of dyslexia studies on non-alphabetic orthographies are available. Classical case studies for acquired dyslexia in Japanese, which has two distinctive scripts (morphographic Kanji and phonographic Kana), reported 'script-dependent' dyslexia patterns. Although recent case studies showed 'scrip...
متن کاملT R 99 - 1 75 6 Unsupervised Statistical Segmentation of Japanese Kanji Strings
Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character n-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of ...
متن کاملKana-Kanji Conversion System with Input Support Based on Prediction
1 I n t r o d u c t i o n TOSHIBA developed the world's first Japanese word processor in 1978. Unlike languages based on an alphabet , Japanese uses /,housands of Ica nji characters of varying comp]exity. Hence, l,o arrange all of l~a'~:ii chm'acl;ers on keyboard is; difficult. On the other hand, kana dlaracters which are phonetic scripl,s of Japanese have 83 variations; these can be arranged o...
متن کاملNormal and impaired reading of Japanese kanji and kana
Two kinds of scripts are used in the written forms of Japanese words: morphographic kanji and phonographic kana. Whereas each kana character invariably represents a single pronunciation, the majority of kanji characters have two or more legitimate pronunciations, with one appropriate to the character in any given word. Furthermore, each kanji character has meaning while a kana character does no...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994